Current immune receptor sequences bioinformatic tools are designed for Illumina reads, whose sequencing error pattern differs from that of the technology used to generate our data, Ion Torrent. This sequencing platform has a relatively high insertion and deletion error rate, especially after homopolymers.
Figure 1: Estimation of the error rates of insertions, deletions, and substitutions in 19 PGM data sets for the HBV-RT region grouped by regions (homopolymer, non-homopolymer and total). Taken from Song, Liting, et al. "Comparison of error correction algorithms for Ion Torrent PGM data: application to hepatitis B virus." Scientific reports 7.1 (2017): 1-11.
Observations:
A JUNCTION corresponds to the region that we previously called CDR3 and now we refer to it as hypervariable region (HVR).
Figure 2: Schematic representation of an IG V–D–J-REGION with its JUNCTION. A JUNCTION may be particularly complex with up to nine regions: 3' end of V-REGION, D-REGIONs (D1, D2 or D3, if several), N-REGIONs (N1, N2, N3 or N4, if several), and 5' end of J-REGION, to which can be added P-REGIONs. Taken from Monod, M. Y., Giudicelli, V., Chaume, D., & Lefranc, M. P. (2004). IMGT/JunctionAnalysis: the first tool for the analysis of the immunoglobulin and T cell receptor complex V–J and V–D–J JUNCTIONs. Bioinformatics, 20(suppl_1), i379-i385.
JUNCTION start: 2nd conserved Cys104 (C) of V:
Figure 3: TRBV aminoacid multiple sequence alignment. 147 "F+ORF+in-frame P with IMGT Gaps" sequences were retreived from IMGT database and were visualized with AliView. Only majority rule consensus aminoacids are highlighted. Conserved Cys104 is highlighted in pink.
JUNCTION end: conserved Phe (F) of the FGXG motif in J:
Figure 4: TRBJ aminoacid multiple sequence alignment. 16 "F+ORF+in-frame P" sequences were retreived from IMGT database. Multiple alignment was performed with Clustal Omega and visualized with AliView. Only majority rule consensus aminoacids are highlighted. FGXG motif is conserved in all but the two ORF J segments.
gapOpening penalty parameter in pairwiseAlignment functionThe aim is to gather representative sequences to find the best alignment parameters that suits the nature of our reads and their sequencing error pattern. Examples:
Example sequences in FASTQ format are retrieved from sample 111H, as it was shown to be a conflictive sample in which approximately 50% of its reads have homopolymers in HVR. Current preprocessing of this sequences include:
CGAT in positions 10-13AACACAGCGACCTCGGGTGGGAACA in positions 14-38Note: after demultiplexing we could save the sequences without the first 38 nucleotides, as this region will not be used in the subsequent alignment steps.
In order to rapidly test some sequences of interest and later validate our proposed pipeline, 3 example sequences with conserved or altered reading frames (RF) were analyzed with a state-of-the-art tool (see below). IMGT/V-QUEST aligns V, D and J segments and delimits JUNCTION (or HVR) boundaries. For V alignment it only takes in account functional (F), open reading frames (ORF) and in-frame pseudogene (P) segments (those 147 segments shown in Figure 3).
Note: We could consider to add alignment of out-of-frame pseudogenes to our pipeline.
For our examples, 4 IMGT/V-QUEST output sections were depicted:
Also, this tool is able to perform an indel correction in V region prior to HVR identification, which could be useful with our Ion Torrent data. Nevertheless, it is only available as a web-based service that admits up to 50 reads per query, so it is not suitable for our massive analysis. There is a high-throughput version called IMGT/HighV-QUEST that analyses up to 150000 sequences per batch, with registration required.
In any case, we can try to implement some ideas from their algorithm, which is not open source, but it is briefly described in the publication Brochet, X., Lefranc, M. P., & Giudicelli, V. (2008). IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized VJ and VDJ sequence analysis. Nucleic acids research, 36(suppl_2), W503-W508.:
"A new option allows to search for potential insertions and deletions in the user sequence. This search follows the identification of the closest V gene and allele. The program proceeds in two alignment steps [Smith and Waterman algorithm (11)] between the user sequence and the closest V genes and alleles, first looking for insertions, then for deletions. If insertions are detected, they are excluded from the user sequence as they disrupt the IMGT numbering but their positions are memorized so they can be displayed in the results page. If deletions are detected, gaps are entered in the user sequence to restore the IMGT numbering."
Both V and J are identified with fairly high scores and identities. Also, consensus Cys and Phe are found and HVR is in-frame.
Sequence
## RCCGR:00688:05792
## GGGCACCAGGCTCCTCGGCTATGTGGCCCTTTGCCTTCTAGGAGCAGGCCCCCTGGAAGCCCAAGTGACCCAGAACCCAAGATACCTCATCACAGTGACTGGAAAGAAGTTAACAGTGACTTGTTCTCAGAATATGAACCATGAGTATATGTCCTGGTACCGACAAGACCCAGGGCTGGGCTTAAGGCAGATCTACTATTCAATGAATGTTGAGGTGACTGATAAGGGAGATGTTCCTGAAGGGTACAAAGTCTCTCGAAAAGAGAGGAGGAATTTCCCCCTGATCCTGGAGTCGCCCAGCCCCAACCAGACCTCTCTGTACTTCTGTGCCAGCAGTCCCTCGAACACCGGGGAGCTGTTTTTTGGAGAAGGCTCTAGGCTGACCGTACTGGAGGACCTGAAAAACGTGTTCCCACCCGAGGTCGCTGTGTTATCGTTCCTTCTG
Result summary
Closest V-REGIONs
Closest J-REGIONs
Analysis and translation of the JUNCTION
4 nucleotide insertions and 2 nucleotide deletions were detected in V region. The two G deletions were located in a tetramer and in a trimer, whereas the insertions happened in non-homopolymeric regions. IMGT/V-QUEST is able to detect them, delete the insertions and introduce gaps in the deletions for a correct alignment of the V segment (with a 99.27% identity and Cys104 correctly located).
Sequence
## RCCGR:00678:05813
## CTGGCCTGACCCTGCCATGGGCACCAGGCTCCTCTGCTGGGTGGGTCCTGGGGTTTCCTAGGGACAGATCAACACCAGGTGCTGGAGTCTCCCAGTCCCCATAGGTACAAAGTCGCAAAGAGAGGACAGGATGTAGCTCTCAGGTGTGATCCAATTTCGGGTCATGTATCCCTTTTTTGGTACCAACAGGCCCTGGGGCCAGGGGCCAGAGTTTCTGACATTATTTCCAGAATGAAGCTCAACTAGACAAATCGGGCTGCCCAGTGATCGCTTCTTTGCGGAAAGGCCAGAGGATCCGTCTCCACTCTGAAGATCCAGCGCACACAGCAGGGAGGACTCCGCCGTGTATCTCTGTGCCAGCAGACAGGGCACTGGGGCCTACGTCCTGACTTTCGGGGCCGGCAGCAGGCTGACCGTGCTGGAGGACCTGAAAAACGTGTTCCCACCCGAGGTCGCTGTGTTATCGTTCCTTCTG
Result summary: The insertions appear as capital letters in the sequence highlighted in green.
Detailed alignment of top 5 V segments: The two G deletions are highlighted in red.
Closest V-REGIONs
Closest J-REGIONs
Analysis and translation of the JUNCTION
V, J and D are correctly identified and no indels were found in V, but the HVR (containing homopolymers) is out-of-frame. The location of the possible indel is intuited thanks to the assignation of all the HVR nucleotides to V, D and J segments (see the "Analysis and translation of the JUNCTION" section).
Sequence
## RCCGR:00686:05941
## CCCCCAGCTCCTTGGCTATGTGGTCCTTTGCCTTCTACGGAGCAGGCCCCCTGGAAGCCCAAGTGACCCAGAACCCAAGATACCTCATCACAGTGACTGGAAAGAAGTTAACAGTGACTTGTTCTCAGAATATGAACCATGAGTATATGTCCTGGTATCGACAAGACCCAGGGCTGGGCTTAAGGCAGATCTACTATTCAATGAATGTTGAGGTGACTGATAAGGGAGATGTTCCTGAAGGGTACAAAGTCTCTCGAAAAGAGAAGAGGAATTTCCCCCTGATCCTGGAGTCGCCCAGCCCCAACCAGACCTCTCTGTACTTCTGTGCCAGCAGTCCCAGCGGGGGGGCCCGGGGAGCTGTTTTTTGGAGAAGGCTCTAGGCTGACCGTACTGGAGGACCTGAAAAACGTGTTCCCACCCGAGGTCGCTGTGTTATCGTTCCTTCTG
Result summary
Closest V-REGIONs
Closest J-REGIONs
Analysis and translation of the JUNCTION: the frameshift is marked here as being possibly caused by two consecutive deletions, but it can also be caused by a single insertion. All HVR nucleotides are assigned to germline sequences but CCC marked as N-REGION and C marked as P-REGION.
Whereas sequencing error correcting in V and J regions can greatly benefit from an alignment with a reference sequence, that is not the case for the newly introduced nucleotides in the HVR region. In this scenario, one hypothesis could be that sequencing quality values (Phred Score) may be helpful to locate the sequencing error-induced frameshift in HVR.
If we look at the sequencing quality profile of the HVR delimited by V-QUEST in our Example #3, the position with the lowest quality is the first (last in the sequencing process) nucleotide of the guanine heptamer. As V-QUEST determines, this heptamer is present in the segment D2*01 (AGCGGGGGGG) and that may be the reason why it determines that to re-establish the correct reading frame, two extra nucleotides will be required, instead of removing one G, because all the guanines aligned with a germline reference sequence.
There are 3 reading frames, being the correct one (we can call it frame 0) determined by the Cys and Phe codons flanking the HVR. If indels occur between these codons, resulting alterations may be:
Intuitively, it is more likely for one single indel event to occur in this region, than for two. In the Example #3 there is a frameshift to frame 2. With this in mind and the fact that the position with the lowest quality is in a large homopolymer, we may conclude that a G insertion happened during the sequencing process.
Figure 5: Example #3 HVR sequencing quality. Dashed lines mark the end of V and the start of J, respectively. Cys and Phe conserved codons that determine the reading frame 0 are highlighted.
To further support this hypothesis, 42-nts-long (multiple of 3) HVR consensus sequence of reads discarded by MiXCR is a nearly exact match with our Example #3, which is 43 nts long because of that extra guanine nucleotide.
Figure 6: Sequencing quality and consensus sequence of 11216 reads discarded by MiXCR with a 42-nts-long HVR (in-frame).
Pre-processing reads (not shown in this notebook):
Reference sequences alignment:
HVR characterization:
* Some J segments contribute to HVR with a thymine homopolymer of known length (2-6 nts depending on the segment). Alignment of the D segment could also be performed.
Reference sequences are available in the IMGT reference directory.
TRBV sequences are available with or without "IMGT gaps", which are introduced according to the IMGT unique numbering and can be seen in Figure 3. The reason why most of these gaps appear in every TRBV sequence is because the unique numbering system is designed to ease the comparison among V-REGIONS not only from T-cell receptors (TR), but from immunoglobulins (IG) too.
V segments sequences are obtained with IMGT gaps (i.e. aligned) in order to easily locate the conserved Cys that marks the start of the HVR (nucleotides 310-312).
A multiple sequence alignment of all the V segments is shown here (HTML widget created with the function msaR from the R package msaR):
With the J segment alignment we aim to identify the J region, but also to find the Phe codon (TTT or TTC) from the conserved FGXG motif.
Figure 7: FGXG conserved motif nucleotide sequence. N = A, C, G or T; Y = C or T.
Because IMGT does not provide a multiple J segment alignment, TRBJ nucleotide sequences were retrieved and aligned with the function msaClustalOmega from the R package msa in order to easily locate the FGXG (or TTYGGNNNNGGN in nucleotides) motif.
The FGXG motif is located in the positions 23-35 in the alignment. There are two J segments tagged as ORF that do not harbor this motif:
In any case, all the aligned J sequences are trimmed from position 23 to their end. These are the resulting sequences that will be used to locate the J region:
Finally, the multiple sequence alignment gaps are removed prior to our pairwise alignment.
vj_alignment functionThe vj_alignment function performs the V and J alignments in the TCRβ reads to delimit their HVR.
There are 5 main steps: V alignment, Cys codon check, J alignment, Phe codon check and HVR extraction.
1. V alignment: complete V sequences are aligned in this step.
2. Cys check: Cys is located in positions 310-312 in the reference V sequences (w/ gaps), and it is expected to be found in the same place in the alignment (not always the case).
3. J alignment: J segments are aligned only from the Phe codon to easily locate it. (Note: alignment of complete J sequences should be implemented, as well as reproduce the Cys check procedure with the Phe codon)
4. Phe check: first 3 nucleotides in the J alignment are checked.
5. HVR extraction: since Cys and Phe conserved codons are located, we can determine the HVR region.
reads: A QualityScaledDNAStringSet (DNAStringSet containing a XStringQuality object) with the reads.Vsegments: A DNAStringSet with the V segments reference sequences (complete sequences with IMGT gaps).Jsegments: A DNAStringSet with the J segments reference sequences (trimmed from the Phe codon).numCores = detectCores() - 4: Number of cores to execute the function in parallel (one read per core).gapOpening = 1: Penalty for opening a gap in the alignment. It is set to a low value because we expect indels to be frequent in our reads.A data.frame with V and J alignment information and HVR sequence, quality and reading frame. This output is inspired in the Adaptive Immune Receptor Repertoire (AIRR) Rearrangement schema format. It has the following fields:
read_ID: read identifier.read_legnth: read length.V_match: name of best V match.V_score: V alignment score.V_identity: V alignment identity.V_start: V region start coordinate in the read.Cys_check: whether Cys codon is found or not.Cys_status: could be match, pseudomatch, insertion, deletion or mismatch.J_match: name of best J match.J_score: J alignment score.J_identity: J alignment identity.J_end: J region end coordinate in the read.Phe_check: whether Phe codon is found or not.HVR_start: Cys start coordinate in the read.HVR_end: Phe end coordinate in the read.HVR_sequence: HVR nucleotide sequence from Cys to Phe codons, both included.HVR_quality: HVR sequencing quality values.HVR_frame: 0 if Cys and Phe are in the same reading frame, 1 or 2 if they are not.The function pairwiseAlignment, which performs the V and J alignments in steps 1 and 3 of our pipeline, has the following arguments (among others not shown here):
pattern: V or J reference sequences.subject: a single read with its sequencing quality. According to the documentation, sequencing quality is used for a quality-based alignment (see "How does quality-based alignment works?" section for more details).type = "global-local": align whole strings in pattern with consecutive subsequence of subject. gapOpening = 1: default was 10.gapExtension = 4: default value, will tune it in the future.scoreOnly = TRUE / FALSE: whether to return just the scores of the optimal pairwise alignment (quicker) or a PairwiseAlignmentSingleSubject object.To obtain alignment information, the following methods of the PairwiseAlignmentSingleSubject object are used:
Views(x): to obtain match coordinates in subject.score(x): alignment score.pid(x, type = "PID3"): alignment identity. "PID3" calculates the percent sequence identity in the following way: \(100 * (\text{identical positions}) / (\text{length shorter sequence})\). IMGT/V-QUEST identity calculation method corresponds to PID2 (\(100 * (\text{identical positions}) / (\text{aligned positions})\)).deletion(x) and insertion(x): to obtain indel coordinates.Instead of a substitutionMatrix with fixed substitution scores for the alignment, two 94x94 quality-substitution matrices are created prior to the alignment (one for match and another for mismatch) with all the substitution scores for all the possible pairs of Phred Score values.
We can assume that our reference sequences have a good sequencing quality. Each position in the alignment can either be a match or a mismatch (if not an indel):
pattern nucleotide being correctly sequenced and the same nucleotide is present in the subject, the substitution score will be high for most Phred Score values of the subject position, but for 0, in which case the substitution score will be penalized.pattern and subject is different, the substitution score will be highly penalized if the subject sequencing quality is high (true mismatch), but the score will be higher if the sequencing quality is low (mismatch due to sequencing error).Let \(ε_i\) be the probability of an error in the base read. For Phred quality measures \(Q\) in \([0, 93]\), these error probabilities are given by:
\[ε_i = 10^{-Q/10}\]
The combined error probability \(ε_c\) for two positions in two different sequences is:
\[ε_c = ε_1 + ε_2 - (n/(n-1)) ε_1ε_2\]
Where \(n\) is the number of letters in the underlying alphabet (i.e. \(n = 4\) for DNA).
Using \(ε_c\), and \(n = 4\) the substitution score \(s\) is given by:
\[ s = \begin{cases} & b \cdot log_{2}(4(1- \varepsilon_{c})) \text{ for } match \\ & b \cdot log_{2}((4/3) \varepsilon_{c}) \text{ for } mismatch \end{cases}\]
Where \(b\) is the bit-scaling for the scoring.
In this section Example #1 is aligned with its best V match (TRBV27*01_F) with different sequencing quality configurations to compare the alignment scores.
## Score: 365.256286621094
Actually, when no quality values are specified, it is assumed a constant Q = 22 for both subject and pattern and the quality-based substitution matrix is used to calculate the alignment score with the following values for match and mismatch:
## Match: 1.98175613057263
## Mismatch: -5.89928563552943
The reason why Q = 22 when no quality information is provided is stated in the Malde et al. publication:
BLASTN uses a default match reward of 1 and mismatch penalty of -3. This corresponds to comparing sequences with constant quality values of approximately 22. (This is equivalent to a single sequence error rate of 0.63%, and a combined error rate of 1.25%.) For sequences where no quality value is provided, the implementation therefore defaults to a quality value of 22.
Because this information does not appear in the pairwiseAlignment documentation, I checked this by manually establishing the sequencing quality to 22 for both sequences and the alignment score is the same.
subject sequencing quality (pattern Q = 22)## Score: 365.53369140625
This is our current configuration.
subject and pattern (constant Q = 93) sequencing quality## Score: 363.365844726562
Another option would be to manually increase pattern sequencing quality to the maximum (Q = 93). Taking a look at the mismatch quality-based substitution score matrix, increasing the pattern quality would result in more penalized scores if a mismatch has a good sequencing quality in the subject (supporting that it is a true mismatch and not a sequencing error).
The results of the application of the vj_alignment function to our three example reads is shown in the following interactive table, which allows searching, column sorting and downloading the data in .csv or .xls format:
## Time difference of 1.454265 secs
The same results as with IMGT/V-QUEST are obtained. V and J matches and the HVR sequences are identical, Cys and Phe conserved codons are found in all the reads and HVR frameshift in Example #3 is correctly identified. We can visualize the V and J alignments for each read:
V alignment
J alignment
All the indels in V are correctly identified. For a better indel visualization: Vis.elements > Show gap weights.
V alignment
J alignment
V alignment
J alignment
To perform the next step in our proposed pipeline (HVR clustering) we need a pool of sequences to test on. Therefore, 1000 reads from the sample 111H have been aligned against V and J segments to identify enough HVR sequences to perform the clustering (45.63278 seconds, 44 cores).
Ideally, all the reads would have good V and J alignment scores and identities and Cys and Phe codons would be found. However, this is not the case in 26/1000 reads:
For now, these reads are removed from the analysis.
In these reads IMGT/V-QUEST does not find any V or J segments either.
Visualizing two of these V alignments, it is clear that the reads do not harbor a TCRβ sequence, since they are full of gaps and mismatches:
In all these reads, both V score and identity are high, Cys is found, but J score is negative. IMGT/V-QUEST does not find any J segments either.
If we sort the table by V_match we see two different V/J combinations (and HVR sequences):
It seems that only the end of the J sequence is present. Phe is not found and a big gap is introduced at the beginning of the J alignment. However, a Phe codon is present in the read (TTC, positions 15-17 in the alignment) although it is not in the same reading frame as the Cys codon and the HVR region would be too small.
The alignment is poor, but Phe codon is found and the resulting HVR is in frame and has an acceptable length (36 nts, when the average is 42 nts).
In these 4 reads, V and J scores are high, Phe is found, but Cys is missing. The cause is stated in the Cys_status column.
Mismatch: 3 reads with different VJ matches and mismatches in Cys. Most likely, the mismatches are sequencing errors. One example is shown here (TGT > CGT in positions 310-312):
One option to include these reads in the subsequent analyses would be to fix these mismatches with the reference sequences, if they have a low occurrence. If the same mismatch in the conserved Cys codon is frequent in a sample, we could argue whether it is a PCR error or a true (biological) nucleotide substitution. In order to decide what to do with those sequences, a literature review in TCR conserved aminoacids would be necessary (an aminoacid substitution could yield a productive TCR if it is similar enough to Cys).
Deletion: only one read shows a deletion in Cys, located in the first nucleotide (-GT). However, if we take a look at the alignment, we can see that the codon is preceded by a thymine homopolymer in positions 306-310:
Since our reads were transformed to reverse complement, the sequencing direction is the opposite as the alignment direction. If sequencing indels usually occur after homopolymers, in our alignments they should appear before, but they are located at the end because of the difference in sequencing and alignment directions. Possible solutions:
vj_alignment function would need to be revised).The vj_alignment function is able to detect insertions in the conserved Cys codon and fix them, since we assume that they are the most frequent sequencing error type in the Ion Torrent technology. Here are the 5 reads whose Cys codon was fixed and one example with its V match aligned (insertion in 311):
There is only one read with good V and J scores, found Cys, but no Phe codon.
The J alignment (from Phe) shows an insertion and a mismatch in the first codon:
However, if we align the whole J segment, we can see two mismatches in Phe (TTT > CTC, positions 19-21):
Same alignment method as for V segment (alignment with complete reference sequences) would correctly identify these cases, but here, either by an insertion and a mismatch or by two mismatches, the Phe codon is lost anyway.
To optimize the clustering step we could group duplicated reads, or potential clonotypes, with the same V and J matches and HVR sequence. In our sample of 974 reads there are only 195 unique combinations of V-HVR-J:
Not all of the HVR sequences are in frame 0 (Cys and Phe codons in the same reading frame). Since we do not have any reference sequence for correcting indels in HVR, a sequence clustering would help to identify the errors and correct them to reestablish the correct reading frame.
The distribution of frames in our 195 unique V-HVR-J combinations is the following:
Most of the HVR are in frame, but for the out of frame ones, frame 2 (one insertion) is more frequent than frame 1 (one deletion), which was expected since we know that Ion Torrent is more prone to insertions than deletions (Figure 1).
Since we are aiming for sequencing error correction, it makes sense to perform the clustering among sequences with the same V and J match. In our filtered sample with 195 unique potential clonotypes, there are a total of 53 different VJ combinations or VJ families:
We need at least two different clonotypes in a VJ family to identify and correct sequencing errors. Therefore, the following clustering steps will not be applied to all the families with only one clonotype. Those 27 clonotypes that constitute a VJ family for themselves are shown here:
Only 4 of them have an out of frame HVR, and we do not have any other HVR to compare for error correction, so they would be considered unproductive sequences. (Note: These VJ families with only one member could have more clonotypes in the rest of the 111H sample.)
This filtering leaves us with 26 different VJ families with more than one unique clonotype.
The largest VJ family (V27*01 / J2-2*01, 46 different HVR sequences) will be used from now on:
First, let's perform a multiple sequence alignment with ClustalW to visualize the HVR sequences. If we sort it by identity (Sorting > Identity), 3 major groups can be seen:
## use default substitution matrix
The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. If we calculate this distance between our reads, we obtain the following distance matrix that, again, shows a pattern of three groups:
Once we have calculated the pairwise distance between our HVR sequences, clustering is the next step. Since we only have a distance matrix and not "points X dimensions" data, we can only execute distance-based algorithms like the following:
https://stats.stackexchange.com/questions/2717/clustering-with-a-distance-matrix
The following dendrogram was generated from a hierarchical clustering made with the function hclust (complete agglomeration method). Then, the tree was divided in three groups:
The red rectangles delimit our three clusters of unique HVR sequences that allegedly come from three different true clonotypes.
To confirm that three groups (\(k = 3\)) is the optimal division, we could calculate the average silhouette width, \(S\), with different values of \(k\). \(S\) takes its highest value with the optimal \(k\), so we can confirm that, according to this measure, three clonotypes is the optimal grouping in this VJ family of 46 unique HVRs:
To sum up all the previous information, in the following figure we show the dendrogram, the HVR multiple sequence alignment and the frame and abundance of each unique HVR:
In every cluster, the most frequent HVR is in frame, so it could be a candidate to be the representative sequence of its cluster:
Now that the clustering of a VJ family is well understood, let's perform a hierarchical clustering with Levenshtein distance for each of the 26 VJ families in our 974 reads sample:
ConsensusClusterPlus TO-DOTwo error correction algorithms combined, Pollux (publication, software) and Fiona (publication, software) were tested with V27*01 / J2-2*01 family HVRs.
The algorithm choice was based on the results of the publication Song, Liting, et al. "Comparison of error correction algorithms for Ion Torrent PGM data: application to hepatitis B virus." Scientific reports 7.1 (2017): 1-11.:
Since Pollux has a greater performance for indel error correction and Fiona has a greater power for distinguishing ‘genuine’ substitutions from sequencing errors, we suggest the combined use of Pollux and Fiona for Ion Torrent PGM data.
These algorithms take in account the frequency of the reads*. Therefore, we do not need to group identical reads beforehand.
Pollux approaches the problem of error correction conservatively, removing little information and requiring a high degree of confidence to make a correction. It first scans across all reads, divides reads into k-mers, and counts the number of occurrences of each observed k-mer. It then scans reads a second time, generates a k-mer depth profile for each read, and uses this information to correct the k-mer profile.
All the 321 HVR sequences of the V27*01 / J2-2*01 family were submitted in FASTQ format (although sequencing quality is not used by Pollux). Default k-mer length is 31, but it is thought to work with at least 150-nt-long reads and no correction is performed in our HVR sequences with this value. For our sequences, whose mean length is 42 nt, a k-mer length of 9 seems to be appropiate.
The result of a hierarchical clustering (with Levenshtein distance) of the Pollux-corrected reads is shown here:
## use default substitution matrix
Unique clonotypes were reduced from 46 to 16, but many species containing mismatches are preserved.
Fiona uses a suffix tree to detect and correct substitution and indel errors, and use edit distance comparisons to enhance overlap detection of indel errors.
Output FASTQ file generated by Pollux is used as Fiona input. Genome size must be specified. A value of 42 nt was used since it is the mean HVR length.
The result of a hierarchical clustering (with Levenshtein distance) of the Pollux-Fiona-corrected reads is shown here:
## use default substitution matrix
Unique clonotypes were reduced from 16 to 9 and only one in frame species remained in each cluster.
It is remarkable that all the out of frame HVRs that were not corrected by any algorithm have an insertion or deletion in the thymine homopolymer. Maybe its location (at the end of the sequence) makes it difficult for these indels to be corrected.
Now, let's apply this three-step pipeline to all the VJ families:
Most HVR sequences are in frame after error correction:
##
## 0 1 2
## 750 49 126
##
## 0 1 2
## 851 35 39
Deprecated ideas will be stored in this section.
Deprecated: One pairwiseAlignment call per V segment takes ~22s/read.
seq_v_alignment function:
Test with Example #1:
## Best match: TRBV27*01_F Score: 329.533721923828 Conserved Cys FOUND at position 326 (TGT)
## Time difference of 25.71661 secs
Deprecated: Slower than lauching a single pairwise alignment with 147 patterns.
Because there are 147 annotated V segments, pairwise alignment can take a while to run in all of our reads. One approach to make this process more efficient could be to establish a certain order for the alignments. IMGT nomenclature for V genes has the following structure:
TRBVXX-Y*ZZ
Where:
X represents the subfamily.Y is the gene number in the subfamily.Z is the allele number.In our previous work, we aligned only the XX-Y*01 sequences (65 V segments), but we could go one step further and perform a sequential pipeline to avoid unnecesary V alignments:
XX-1*01, i.e. one segment per subfamily (total of 28 subfamilies, as TRBV8 and TRBV22 are out-of-frame pseudogenes)XX-Y*01 sequences in that subfamily to identify the gene.XX-Y*ZZ of that gene.To check the sequence homology among V genes in a subfamily, I performed a multiple sequence alignment of all the V segments (trimmed up to the conserved Cys, since we will use this sequences for the alignment) with the function msaClustalOmega from the R package msa and plotted a cladogram with the R package ggtree.
## using Gonnet
All the genes of the same subfamily cluster together, so we can assume that it is safe to pick only one gene of the subfamily to identify it correctly.
hier_v_alignment function:
Test with Example #1:
## Best match: TRBV27*01_F Score: 329.533721923828 Conserved Cys FOUND at position 326 (TGT)
## Time difference of 4.971066 secs
v_alignment and j_alignment functionsDeprecated: The vj_alignment function combine these two.
The v_alignment function does the following:
scoreOnly = TRUE) and get the one with the best score.scoreOnly = FALSE).PairwiseAlignmentSingleSubject object.To better visualize the alignment, the function pwa_vis was created, which is based on the msaR function from the msaR R package to produce an interactive HTML widget output.
The three examples show the same V match as the one provided by IMGT/V-QUEST.
## Best match: TRBV27*01_F
## Score: 329.533721923828 Identity: 86.8589743589744
## Conserved Cys FOUND at position 326 (TGT)
## Time difference of 0.4073238 secs
All the V indels are correctly identified.
## Best match: TRBV7-8*01_F
## Score: 266.789489746094 Identity: 87.1794871794872
## Conserved Cys FOUND at position 353 (TGT)
## Time difference of 0.3908341 secs
## Best match: TRBV27*01_F
## Score: 342.853149414062 Identity: 87.5
## Conserved Cys FOUND at position 324 (TGT)
## Time difference of 0.382462 secs
In order to facilitate the J alignment, we could remove the already matched V region from our reads and align the reference J segments in the remaining sequence.
The function j_alignment performs the alignment of the trimmed J segments and check the Phe codon presence.
In our examples the J alignment identity values are not the same as IMGT/V-QUEST because we are aligning only from Phe, whereas it aligns the whole J segment.
## Best match: TRBJ2-2*01_F
## Score: 61.562313079834 Identity: 100
## Conserved Phe FOUND at position 362 (TTT)
## Time difference of 0.2756319 secs
## Best match: TRBJ2-6*01_F
## Score: 61.5678253173828 Identity: 100
## Conserved Phe FOUND at position 392 (TTC)
## Time difference of 0.2527661 secs
## Best match: TRBJ2-2*01_F
## Score: 61.6492309570312 Identity: 100
## Conserved Phe FOUND at position 364 (TTT)
## Time difference of 0.2454381 secs
vj_alignment function with V segments trimmed up to Cys codonDeprecated: Complete V segments, unlike trimmed ones, help with the correct indel and mismatch identification in Cys codon.
Two pairwise alignments for each region (two for V and two for J) are performed with the pairwiseAlignment function from the R package Biostrings. This function’s computation time is proportional to the product of the two string lengths being aligned.
First, all the reference segments are aligned in a quick way to get the alignment scores and identify the best match. Second, the best segment is aligned again, but returning an instance of class PairwiseAlignmentSingleSubject which has methods to obtain information about match coordinates, indels and substitutions.
The function pairwiseAlignment has the following arguments (among others):
pattern: a DNAStringSet containing all the V segments trimmed up to the conserved Cys or the J segments trimmed from the conserved Phe.subject: a QualityScaledDNAStringSet (DNAStringSet containing a XStringQuality object) of length 1, i.e. a single read with its sequencing quality. According to the documentation: "Objects of class XStringQuality representing the respective quality scores for pattern and subject are used in a quality-based method for generating a substitution matrix".type: global-local (align whole strings in pattern with consecutive subsequence of subject). Since we are trying to find the conserved Cys and Phe codons in the read, forcing to align the whole reference sequence will ensure that these codons will be correctly aligned, if present.gapOpening: 1 (default was 10)gapExtension: 4 (default, will tune it in the future)scoreOnly: return just the scores of the optimal pairwise alignment. TRUE at first, FALSE when the best match is already identified.We will use the following methods of the PairwiseAlignmentSingleSubject object:
Views(x): to obtain match coordinates in subject.score(x): alignment score.pid(x, type = "PID3"): alignment identity. "PID3" calculates the percent sequence identity in the following way: \(100 * (\text{identical positions}) / (\text{length shorter sequence})\). IMGT/V-QUEST identity calculation method corresponds to PID2 (\(100 * (\text{identical positions}) / (\text{aligned positions})\)).Note: we could also obtain indel and mismatch information with the indel(x) and summary(x) methods, respectively.
Once the V and J segments are aligned, we can check the presence of the Cys and Phe codons that flank the HVR, obtain the HVR nucleotide sequence and sequencing quality and determine its reading frame.
The described alignment pipeline is implemented in the vj_alignment function, which performs the alignments in parallel (one read per core). The output is inspired in the Adaptive Immune Receptor Repertoire (AIRR) Rearrangement schema format.
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-2 ape_5.4-1 seqinr_4.2-4
## [4] dplyr_1.0.2 htmlwidgets_1.5.2 plyr_1.8.6
## [7] bsselectR_0.1.0 stringr_1.4.0 ggtree_2.0.4
## [10] factoextra_1.0.5 ggpubr_0.4.0 ggraph_2.0.2
## [13] tidyr_1.1.2 tibble_3.0.4 msa_1.16.0
## [16] msaR_0.3.0 ggplot2_3.3.2 Biostrings_2.52.0
## [19] XVector_0.24.0 IRanges_2.18.3 S4Vectors_0.22.1
## [22] BiocGenerics_0.30.0
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.4-1 ggsignif_0.5.0
## [3] hwriter_1.3.2 ellipsis_0.3.1
## [5] rio_0.5.16 GenomicRanges_1.36.1
## [7] farver_2.0.3 graphlayouts_0.7.1
## [9] ggrepel_0.8.1 DT_0.16
## [11] robustbase_0.93-6 knitr_1.30
## [13] polyclip_1.10-0 ade4_1.7-16
## [15] jsonlite_1.7.1 Rsamtools_2.0.3
## [17] broom_0.7.6 cluster_2.1.0
## [19] png_0.1-7 ggforce_0.3.1
## [21] BiocManager_1.30.10 compiler_3.6.1
## [23] rvcheck_0.1.8 backports_1.2.0
## [25] Matrix_1.2-17 lazyeval_0.2.2
## [27] tweenr_1.0.1 htmltools_0.5.1.1
## [29] prettyunits_1.1.1 tools_3.6.1
## [31] igraph_1.2.6 gtable_0.3.0
## [33] glue_1.4.2 GenomeInfoDbData_1.2.1
## [35] reshape2_1.4.4 ShortRead_1.42.0
## [37] Rcpp_1.0.5 carData_3.0-4
## [39] Biobase_2.44.0 cellranger_1.1.0
## [41] jquerylib_0.1.3 vctrs_0.3.8
## [43] nlme_3.1-140 crosstalk_1.1.0.1
## [45] xfun_0.19 openxlsx_4.2.3
## [47] lifecycle_0.2.0 rstatix_0.7.0
## [49] DEoptimR_1.0-8 zlibbioc_1.30.0
## [51] MASS_7.3-51.1 scales_1.1.1
## [53] tidygraph_1.2.0 hms_0.5.3
## [55] SummarizedExperiment_1.14.1 yaml_2.2.1
## [57] curl_4.3 gridExtra_2.3
## [59] sass_0.3.1 latticeExtra_0.6-29
## [61] stringi_1.5.3 tidytree_0.3.3
## [63] zip_2.1.1 BiocParallel_1.18.1
## [65] GenomeInfoDb_1.20.0 rlang_0.4.11
## [67] pkgconfig_2.0.3 bitops_1.0-6
## [69] matrixStats_0.57.0 evaluate_0.14
## [71] lattice_0.20-38 purrr_0.3.4
## [73] GenomicAlignments_1.20.1 treeio_1.8.2
## [75] labeling_0.4.2 cowplot_1.1.0
## [77] tidyselect_1.1.0 magrittr_1.5
## [79] R6_2.5.0 generics_0.1.0
## [81] DelayedArray_0.10.0 pillar_1.4.6
## [83] haven_2.3.1 foreign_0.8-72
## [85] withr_2.3.0 abind_1.4-5
## [87] RCurl_1.98-1.2 crayon_1.3.4
## [89] car_3.0-10 rmarkdown_2.7
## [91] viridis_0.5.1 jpeg_0.1-8.1
## [93] progress_1.2.2 grid_3.6.1
## [95] readxl_1.3.1 data.table_1.13.2
## [97] forcats_0.5.0 digest_0.6.27
## [99] munsell_0.5.0 viridisLite_0.3.0
## [101] bslib_0.2.4